Multi-core processing for parallel speech-to-text processing

ABSTRACT

This specification describes technologies relating to multi core processing for parallel speech-to-text processing. In some implementations, a computer-implemented method is provided that includes the actions of receiving an audio file; analyzing the audio file to identify portions of the audio file as corresponding to one or more audio types; generating a time-ordered classification of the identified portions, the time-ordered classification indicating the one or more audio types and position within the audio file of each portion; generating a queue using the time-ordered classification, the queue including a plurality of jobs where each job includes one or more identifiers of a portion of the audio file classified as belonging to the one or more speech types; distributing the jobs in the queue to a plurality of processors; performing speech-to-text processing on each portion to generate a corresponding text file; and merging the corresponding text files to generate a transcription file.

BACKGROUND

The present disclosure relates to multi-core processing for parallelspeech-to-text processing.

Speech-to-text systems generate a text transcript from audio content.Speech-to-text techniques typically use speech recognition to identifyspeech from audio. Speech-to-text can be used for several speechrecognition applications including, for example, voice dialing, callrouting, data entry, and dictation.

A speech-to-text recognition system typically digitizes an audio signalinto discreet samples. Those discreet samples are generally processed toprovide a frequency domain analysis representation of the original inputaudio signal. With the frequency domain analysis of the signal, arecognition system maps the frequency domain information into phonemes.Phonemes are the phonetic sounds that are the basic blocks used tocreate words in every spoken language. For example, the English writtenlanguage has an alphabet of 26 letters. However, the vocabulary ofEnglish phonemes is typically a different size. The mapping provides astring of phonemes mapped to the frequency domain analysisrepresentation of the original input signal. Speech detection processingresolves the phonemes using a concordance or a dictionary.

A typical parallel processing technique includes a split function thatphysically divides an audio file into roughly equal portions. The splitfunction intelligently divides the audio file, e.g., so that thedivision does not split words. The split points occur in intervals withno sound or during any intervals a signal classifier identifies asnon-dialogue. The split function accepts an optional exclusion intervalfile to identify and filter non-dialogue from the audio file. The splitfunction separates the entire audio file into portions havingapproximately the same amount of transcription data. The processescomplete at approximately the same time. The portions are processed intotext files.

Once text files are generated, a merge function that accepts partialspeech-to-text transcripts merges separate text files from the processedportions into one transcription file. The merge function uses a mastertime portion index of start and end times for each portion. The splitmethod generates the master portion index, which is used to sequencetime codes for text files being merged. When the last portion has beenprocessed to the last text file, the initiating master process theninvokes the merge function to recombine the results. The output fromthis process is a single textual transcription of the original inputsignal.

SUMMARY

This specification describes technologies relating to multi-coreprocessing for parallel speech-to-text processing.

In general, one aspect of the subject matter described in thisspecification can be embodied in computer-implemented methods thatinclude the actions of receiving an audio file; analyzing the audio fileto identify portions of the audio file as corresponding to one or moreaudio types, the one or more audio types including one or more speechtypes; generating a time-ordered classification of the identifiedportions, the time-ordered classification indicating the one or moreaudio types and position within the audio file of each respectiveportion; generating a queue using the time-ordered classification, thequeue including multiple jobs where each job includes one or moreidentifiers of a respective portion of the audio file classified asbelonging to the one or more speech types; distributing the jobs in thequeue to processors for speech-to-text processing of the correspondingportion of the audio file; performing speech-to-text processing on eachportion to generate a corresponding text file; and merging thecorresponding text files to generate a transcription file, the textfiles merged in order based on the order in which the portions of theaudio file occur within the audio file. Other embodiments of this aspectinclude corresponding systems, apparatus, and computer program products.

These and other embodiments can optionally include one or more of thefollowing features. The method further includes distributing theprocessor job descriptors for the portions of the audio file in thequeue occurs in a first-in-first-out order. The method further includesdistributing one or more of the jobs in the queue before the generatedqueue is completed. The method further includes dividing merging thetext files recursively amongst the processors.

The method further includes distributing the processors amongst one ormore computing devices. The method further includes distributing thejobs in the queue to two or more remote locations, each remote locationincluding multiple processors, each remote location connected to one ormore other remote locations. The method further includes one or moreaudio types including data with spoken language. The method furtherincludes partitioning the queue into units based on a specified amountof time; and partitioning the units in substantially equal amounts oftime to each of the processors. The method further includes partitioningthe queue into units based on a specified amount of data; andpartitioning the units in substantially equal amounts of data to each ofthe processors.

The method further includes a client device identifying and classifyingthe portions of the audio file into the one or more audio types withinthe audio file. The method further includes partitioning the receivedaudio file for identification and classification using multipleclassifiers. The method further includes the classifiers including oneor more of the following classifiers: dialogue, applause, music,silence, and ambient noise. The method further includes storing thetranscription file. The method further includes generating portiondescriptors for each portion. The method further includes the portiondescriptors including metadata associated with amount of time and one ormore classifiers associated with the portions of the audio file. Themethod further includes the time-ordered classification includingdetermining a time interval for each portion of the identified portions.

In general, one aspect of the subject matter described in thisspecification can be embodied in methods performed by a computerprogrammed to provide speech-to-text processing include the actions ofreceiving an audio file; analyzing the audio file to identify portionsof the audio file as corresponding to one or more audio types, the oneor more audio types including one or more speech types; generating atime-ordered classification of the identified portions, the time-orderedclassification indicating the one or more audio types and positionwithin the audio file of each respective portion; generating a queueusing the time-ordered classification, the queue including a pluralityof jobs where each job includes one or more identifiers of a respectiveportion of the audio file classified as belonging to the one or morespeech types; distributing the jobs in the queue to a plurality ofprocessors for speech-to-text processing of the corresponding portion ofthe audio file; performing speech-to-text processing on each portion togenerate a corresponding text file; and merging the corresponding textfiles to generate a transcription file, the text files merged in orderbased on the order in which the portions of the audio file occur withinthe audio file.

Particular embodiments of the subject matter described in thisspecification can be implemented to realize one or more of the followingadvantages. The system enhances speech-to-text transcription to provideaccurate transcripts of an audio file in an efficient manner. Removingnon-dialogue portions before speech-to-text processing decreases theamount of time for processing and improves the accuracy of thetranscript by removing audio data that can trigger false positives.Additionally, using multi-core parallel processing reduces the amount oftime to complete a transcription, while allowing portions of the audiofile to be processed in larger portions. The system requires lesscomputer readable storage space and less transfer time to process theportions than serial processing and other parallel speech-to-texttechniques.

The details of one or more embodiments of the invention are set forth inthe accompanying drawings and the description below. Other features,aspects, and advantages of the invention will become apparent from thedescription, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is an illustration of an example audio file partitioned intoportions classified according to audio type.

FIG. 2 is a flow chart of an example method for speech-to-textprocessing audio files using multi-core parallel processing.

FIG. 3 is a diagram representing an example classification system foridentifying portions of an audio file corresponding to one or more audiotypes.

FIG. 4 is a diagram representing an example system for generating atime-ordered classification of portions of an audio file.

FIG. 5 is a diagram representing an example system for distributingprocessor job descriptors for portions of an audio file to processorsfor speech-to-text processing.

FIG. 6 is a diagram representing an example system for distributingprocessor job descriptors for portions of an audio file to processorsdistributed over a network.

FIG. 7 is a diagram representing an example system for merging textfiles to generate a transcription file.

FIG. 8 is a block diagram of an exemplary user system architecture.

DETAILED DESCRIPTION

FIG. 1 is an illustration 100 of an example audio file 102 classified byaudio types. An audio file can include different types of audio data.For example, a typical movie includes dialogue, a score, and soundeffects. Often, there are periods of silence in a movie. The amount oftime where speech is present in a movie can be minimal, e.g., “The Bear”where only a few moments of human speech are heard, or where thedialogue drives the plot, e.g., “A Man for All Seasons,” adapted from aRobert Bolt play. In either example movie's case, the entire movie doesnot have speech for every minute, and a percentage of the audio file forthe movie would not provide useful data from speech-to-text processing.

The audio file 102 in FIG. 1 illustrates the divisions that can occur inan audio file for a movie or television show. In particular, theillustration 100 includes a time scale 104 and different identifiedaudio types for portions of the audio file 102. For example, theillustration 100 identifies a music portion 106 and a dialogue portion108. The identified types in the audio file 102 are classified into timeordered portions according to the time scale 104. For example, the musicportion 106 begins at T1 and ends at T5 and the dialogue portion 108begins at T5 and ends at T8. Identified portions can be non-contiguous.As shown in FIG. 1, the portions 106, 108 are non-contiguous. Forexample, the music portion 106 is split from T1 to T3 and from T4 to T5.FIG. 1 shows T3-T4, T6-T7, T13-T14 all indicating silence intervals.These intervals are detected when no signal is present or the signal isbelow a volume threshold. The system can use these silence intervals ascandidate locations to partition the audio file 102 when no other musicor non-dialogue signal is detected.

The audio file 102 includes data with spoken language. In someimplementations, the music portion 102 includes spoken language inaddition to music, e.g., a rap song, or a musical with an interlude.Likewise, the spoken language can include lyrics in a song. Otherportions can also include a spoken language component.

In some implementations, non-dialogue portions create artifacts that canmirror speech even if the sound is not speech. For example, music canresult in speech-to-text processing false positives, e.g., a string ofwords “the” “the” “the”. Thus, music or sound effects can triggerspeech-to-text processing.

FIG. 2 is a flow chart of an example method 200 for speech-to-textprocessing audio files using multi-core parallel processing. Forconvenience, the method 200 will be described with respect to a systemthat will perform the method 200.

The system receives 202 an audio file. The audio file may be receivedfrom a computer readable medium, e.g., a CD, DVD, or memory. The audiofile can be, for example, a WAV, MP3, or other audio file. The audiofile can be locally stored or retrieved from a remote location. Theaudio file can be received, for example, in response to a user selectionof a particular audio file. In some implementations, a user uploads theaudio file from a client device. The system 200 can receive the audiofile from an application. For example, a digital audio workstation canprovide the audio file before or after digital signal processing.

The system analyzes 204 the audio file to logically identify portions ofthe audio file as corresponding to a particular audio type. In someimplementations, the audio type includes speech. For example, the audiotype may be a dialogue type. The system can identify portions withdialogue and provide no identifier for non-dialogue portions.Alternatively, the system can use a classifier to determine the audiotype of particular portions of the audio file. For example, the audiotypes can include dialogue, silence, applause, and music. In someimplementations, the audio types include more than one audio type, e.g.,dialogue and applause or dialogue and music.

The system generates 206 a time-ordered classification of the identifiedportions. In some implementations, the system indicates the audio typeand position within the audio file of each respective portion. Forexample, the system can associate a string of data to each portion,providing the beginning and end time of each portion.

The system generates 208 a queue that identifies the portions of theaudio file classified as belonging to one of the speech types. The queuedoes not include identifiers for portions of the audio file that are notclassified as belonging to one of the speech types. For example, thequeue can identify portions of the audio file classified as having aspeech type, e.g., the dialogue type. The queue can also identifyportions that have at least the dialogue type. For example, the systemcan identify some portions as having both the dialogue type and themusic type. In some implementations, the system copies and stores theportions that the queue identifies to a new location. Alternatively, thesystem can generate a list of pointers or other logical identifiers forportions in the audio file that the queue identifies.

As shown in FIG. 1, the appended data can provide time codes for thetime interval in the audio file 102, e.g., a dialogue portion beginningat T9 and ending at T12. In some implementations, the system positionseach respective portion in the order it appears in the audio file 102.For example, the two portions of a dialogue portion in FIG. 1 would beplaced in the order that they appear in the audio file 102 so thatsection T9-T10 appears before section T11-T12.

The system distributes 210 processor job descriptors for the portions ofthe audio file in the queue to processors for speech-to-text processing.Each job identifies or includes a particular portion of the audio datafor processing. For example, the system can distribute one or more ofthe processor job descriptors for the portions to the processors in afirst-in-first-out order from the queue. In some implementations, thesystem logically partitions some of the portions after generating thequeue to determine file sizes appropriate for the processors and the lagassociated with communication between the processor and the location ofthe file. For example, if the processor is in a remote location wherelag associated with the communication between the processor and thelocation of the file will take longer than the speech-to-textprocessing, the system can determine processor job descriptors formultiple portions to send to the processor and logically partition atleast some of the portions together to remove lag time. Alternatively,the system can monitor the processing speed of the queue and determinethat at least one processor can process more data. The system canreroute or distribute job descriptors of the remaining portions to theprocessor instead of the originally intended processor with a largeramount of data remaining to be processed.

In some implementations, the processors are in a single device. Forexample, a computing device may have two processors to performspeech-to-text processing or one or more distinct processors havingmultiple cores. Alternatively, the processors are separately located andconnected through a network. For example, the system can be located invarious points through a network. The network can be a local areanetwork, a wide area network, or in a “cloud” across the Internet. Thesystem can allow for multiple audio files to be uploaded for processingand designate various processors for each audio file. Alternatively,each processor can process any portion of any audio file as theprocessor becomes available to do so.

The system performs 212 speech-to-text processing on each portion togenerate a corresponding text file. In particular, each processorprocesses the portions identified by the received job assignments. Forexample, for a given identified portion, the processor receives datafrom the audio file corresponding to the identified portion. Thespeech-to-text processing can map a frequency domain analysisrepresentation of the portions to phonemes. The system can resolve thephonemes using a concordance or a dictionary in combination with one ormore language models. In some implementations, each processing modulecan have its own concordance or dictionary stored locally.Alternatively, the system can store a single concordance or dictionaryin a central location.

The system merges 214 the corresponding text files to generate atranscription file. For example, the system can merge the text files inorder based on the sequence in which the portions occur within the audiofile. In some implementations, the system merges text files as they aregenerated into larger text files. The system then merges the larger textfiles to generate the transcription file. Alternatively, the systemmerges the text files into the transcription file in a serial order withrespect to the location of the portions in the audio file. For example,transcriptions of the portions T1-T3 and T4-T5 in FIG. 1 would appear inthe same order as they do in the audio file 102.

The system stores 216 the transcription file. The transcription file canbe stored locally or remotely. In some implementations, thetranscription file is transmitted to another location for storing.Additionally, the transcription file can be further processed, e.g.conversion from a .txt file to an .html file or performing a spell-checkor grammar check on the transcribed text.

FIG. 3 is a diagram representing an example classification system 300for identifying portions of an audio file corresponding to one or moreaudio types. The system 300 receives an audio file 302 in a signalclassifier 304 to develop a probabilistic representation to analyze theaudio file 302. The signal classifier 304 processes an input audio file(e.g., audio file 302) to generate a content portion list 318 thatidentifies classified audio portions and associated time intervals. InFIG. 3, the signal classifier 304 compares the audio file 302 to sampleaudio representations, including a sample dialogue representation 306, asample music representation 308, a sample dialogue and musicrepresentation 310, and a sample silence representation 312. The signalclassifier 304 processes these sample representations using a machinelearning module 314 to generate data for a comparator 316 to compareagainst the audio file 302. Using the comparison generated from thecomparator 316, the signal classifier 304 generates the content portionlist 318 associating timing codes 320 with audio types 322 to eachportion of the audio file 302.

In some implementations, the signal classifier 304 identifies portionsof the audio file 302 as corresponding to one or more particular audiotypes. For example, the signal classifier 304 can identify the portionsof the audio signal having audio types as including a speech type.However, the audio types that do not include a speech type can stillinclude data with spoken language. For example, a portion with an actiontype, where explosions, gunfire, car chase sounds, or other actionsounds predominate the portion, may still have spoken language. In someimplementations, the signal classifier 304 is trained offline usingrepresentative audio signal files and stores a compact probabilisticrepresentation of each file to compare against the audio file 302.

In some implementations, the portions have multiple types orsub-classifications to identify to the system whether or not a portioncan be classified for speech-to-text processing. For example, the sampledialogue and music representation 310 can provide a comparison file toprovide the classifier that a portion has both dialogue and lyrics thatcan be processed using speech-to-text processing.

As shown in FIG. 3, the content portion list 318 may only list theportion as having the dialogue type in the audio types 322 for a portionwith both music and dialogue to designate that the portion should bequeued for speech-to-text processing. Alternatively, the system 300 canlist both types in the content portion list 318 for the processors. Insome implementations, types that are not specific to speech designate aportion for further digital signal processing to enhance speech withinthe portion. Likewise, the system 300 can perform digital signalprocessing to reduce artifacts within the portion that may generatefalse positives in speech-to-text processing.

FIG. 4 is a diagram representing an example system 400 for generating atime-ordered classification of portions of an audio file. A contentportion list 402 provides timing codes and audio types for each portionof the audio file. The timing code indicates the position of arespective portion within the audio file. The system 400 filtersportions of the audio file that are classified as having dialogue types(404-420). The system identifies the dialogue portions (404-420) togenerate a queue 422. In some other implementations, the queue includesportions identified as belonging to other speech types in addition to adialogue type. The queue can determine processor job descriptors forportions of the audio file for parallel processing. Alternatively, thequeue can involve transferring the portion from the audio file to storein another location, copying the portion from the audio file to transferthe copy to another location, or storing a pointer to the portion in theaudio file and storing the pointers.

In some implementations, the time-ordered classification includesdetermining a time interval for each portion of the identified portions.For example, the time-ordered classification in FIG. 4 displays a nearlyequal distribution of the processor job descriptors for the portionsover the time in the audio file, e.g., dialogue from T05 to T06, silencefrom T06 to T07, and dialogue from T07 to T08. In some implementations,the portions are partitioned according to the amount of data forprocessing, for example, by the amount of dialogue estimated in aparticular partition. For example, if an audio file has a very fastspeaker and a very slow speaker, the portions with the slow speaker maybe for longer periods of time than the portions with the fast speaker.

In some implementations, the queue 422 includes generating portiondescriptors for each portion. The portion descriptors can includemetadata associated with amount of time and classifiers associated withthe portions of the audio file. For example, the time-orderedclassification can be the starting point and ending point of eachportion, e.g., two byte offset values. Likewise, the metadata caninclude audio type data, density of dialogue data, and an indicator ofthe order that the portion belongs in the audio file.

FIG. 5 is a diagram representing an example system 500 for distributingprocessor job descriptors for portions of an audio file 514 toprocessors for speech-to-text processing. The system distributesprocessor job descriptors for portions of the audio file 514 from aqueue 502 to a multi-core platform 504 amongst processors (506-512) forspeech-to-text processing. The queue 502 distributes the processor jobdescriptors for the portions of the audio file 514 for processing. Eachof the processors (506-512) can execute its own transcription of portiondescriptors. In some implementations, the system 500 partitions thequeue 502 into units based on a specified amount of time. The system 500can partition the units in substantially equal amounts of time to eachof the processors (506-512).

In some implementations, the system 500 distributes processor jobdescriptors for the portions of the audio file in the queue 502 to theprocessors in a first-in-first-out order. Alternatively, the system 500can distribute the processor job descriptors for the portions of thequeue in a last-in-first-out order. In some implementations, the system500 applies a processor sharing order where network capacity is sharedbetween the processor job descriptors of the portions of the audiosignal. The processor job descriptors can effectively experience thesame delay, or priority order, where processor job descriptors with highpriority are served first. For example, the system estimates that afirst processor job descriptor has more speech data than a secondprocessor job descriptor, the system can transmit the first processorjob descriptor to clear the larger portions of data for speech-to-textprocessing.

In some implementations, the system 500 distributes the processor jobdescriptors for portions of the queue 502 before the generated queue 502is completed. For example, the first processor job descriptorsclassified can be part of a first-in-first-out system. The system 500,using a first-in-first-out system, can implement multi-core processingas the system 500 identifies portions of the audio file 514 for thequeue 502. In some implementations, the system 500 generates the queue502 in partitions. For example, the system 500 can generate a firstqueue partition from a first partition of the audio file to allow thesystem 500 to distribute the processor job descriptors for portions ofthe audio file in the first queue partition to processors beforegenerating a second queue partition. Alternatively, the system 500 cangenerate the second queue partition and distribute processor jobdescriptors for portions of the audio file from the first partitionconcurrently.

FIG. 6 is a diagram representing an example system 600 for distributingprocessor job descriptors for portions of an audio file to processorsdistributed over a network. In particular, the system 600 includes aload balancer 602 receiving requests for processing. The load balancer602 provides processor job descriptors from queues (604-606) tomulti-core platforms (608-610), each with processors (612-626) forspeech-to-text processing. The queues (604-606) each have at least aportion of an audio file (628-630) to distribute processor jobdescriptors to the multi-core platforms (608-610). The multi-coreplatforms (608-610) transmit text files generated by the processors(612-626) to the load balancer 602.

In some implementations, the portions of audio file (628-630) in thequeue (604-606) are distributed processor job descriptors to two or moreremote locations, each remote location including two or more processors.For example, the load balancer 602 can upload each request and transmitthe audio file (628-630) to the multi-core platforms (608-610). In someimplementations, one of the queues (604-606) exists in a centralizedlocation, e.g., a load balancer node. For example, the processors(612-626) can read the partitions from the one of the queues (604-606)directly.

The remote locations can be connected to one or more other remotelocations. For example, the system 600 can include multiple processingnodes. The system can use various techniques to distribute the processorjob descriptors for data for parallel processing. For example, thesystem can use MapReduce to process data across a large network. Thesystem can provide an upload site for a user to upload the audio file sothat the user's client device transmits the audio file and thendownloads the completed transcription. In some implementations, a useruploads an audio file through an application on a client device. Theclient device can transmit the audio file to the load balancer 602 forprocessing. A central processing node can classify portions of the audiofile.

In some implementations, the system 600 exploits all processing cores inthe central node. Once the system 600 has completed the processing, theload balancer 602 can transmit a final product to the client device.Alternatively, the client device application can generate the queuebefore sending processor job descriptors for portions of the audio fileto the load balancer 602 for processing. The system 600 can also splitthe queues (604-606) for processing in locations within the network. Thesystem 600 can process more than one audio file at a time.

In some implementations, the system 600 detects an error that occurred.For example, the error can be a non-recoverable error, e.g., out of diskspace for text files. The system 600 can halt processing portions of theaudio file (628-630) or merging text files. The text files can bedeleted or stored in memory. In some implementations, the system 600detects an error that can be recoverable, e.g., a transient error duringprocessing a portion of the audio file (628-630). For example, thesystem 600 can attempt speech-to-text processing for a specified numberof attempts. If the attempts all fail, the system 600 can report anerror message to a user.

FIG. 7 is a diagram representing an example system 700 for merging textfiles to generate a transcription file. The system 700 distributesprocessor job descriptors for portions of an audio file 702 from a queue704 to a multi-core platform 706. The multi-core platform 706 assignsthe portions amongst processors (708-714) for speech-to-text processing.The processors (708-714) generate text files 716 from the portions ofthe queue 704. The system 700 merges each text file 718 using a merger720 to generate a transcription file. In some implementations, thesystem 700 merges the text files 716 amongst the processors (708-714).The system 700 can store the transcription file 722 in a computerreadable medium.

In some implementations, the system 700 performs speech-to-textprocessing on each portion of the queue 704 to generate a correspondingtext file 718. Alternatively, each processor (708-714) can generate asingle text file with metadata for the merger 720 to generate thetranscription file 722. The text files 716 can be merged in order basedon the order that the portions of the audio file 702 occur within theaudio file 702. In some implementations, the system 700 merges the textfiles 718 serially. Alternatively, the system 700 can merge the textfiles 718 in a parallel system, merging the text files 718 into groupsand then merging the groups of text files 718 in a hierarchical fashion.

FIG. 8 is a block diagram of an exemplary user system architecture 800.The system architecture 800 is capable of hosting an audio processingapplication that can electronically receive, display, and edit one ormore audio signals. The architecture 800 includes one or more processors802 (e.g., IBM PowerPC, Intel Pentium 4, etc.), one or more displaydevices 1404 (e.g., CRT, LCD), graphics processing units 806 (e.g.,NVIDIA GeForce, etc.), a network interface 808 (e.g., Ethernet,FireWire, USB, etc.), input devices 810 (e.g., keyboard, mouse, etc.),and one or more computer-readable mediums 812. These components exchangecommunications and data using one or more buses 814 (e.g., EISA, PCI,PCI Express, etc.).

The term “computer-readable medium” refers to any medium thatparticipates in providing instructions to a processor 802 for execution.The computer-readable medium 812 further includes an operating system816 (e.g., Mac OS®, Windows®, Linux, etc.), a network communicationmodule 818, a browser 820 (e.g., Safari®, Microsoft® Internet Explorer,Netscape®, etc.), a digital audio workstation 822, and otherapplications 824.

The operating system 816 can be multi-user, multiprocessing,multitasking, multithreading, real-time and the like. The operatingsystem 816 performs basic tasks, including but not limited to:recognizing input from input devices 810; sending output to displaydevices 804; keeping track of files and directories on computer-readablemediums 812 (e.g., memory or a storage device); controlling peripheraldevices (e.g., disk drives, printers, etc.); and managing traffic on theone or more buses 814. The network communications module 818 includesvarious components for establishing and maintaining network connections(e.g., software for implementing communication protocols, such asTCP/IP, HTTP, Ethernet, etc.). The browser 820 enables the user tosearch a network (e.g., Internet) for information (e.g., digital mediaitems).

The digital audio workstation 822 provides various software componentsfor performing the various functions for amplifying the primarilydominant signal in a audio data file, as described with respect to FIGS.1-7 including receiving an audio file, analyzing the audio file toidentify portions corresponding to audio types, generating time-orderedclassifications of the identified portions, generating a queue includingthe classified portions, distributing processor job descriptors for theportions in the queue to processors to process the portions into textfiles, and merging the text files in the order of the portions in theaudio file to generate a transcription file. The digital audioworkstation can receive inputs and provide outputs through an audioinput/output device 626.

Embodiments of the subject matter and the functional operationsdescribed in this specification can be implemented in digital electroniccircuitry, or in computer software, firmware, or hardware, including thestructures disclosed in this specification and their structuralequivalents, or in combinations of one or more of them. Embodiments ofthe subject matter described in this specification can be implemented asone or more computer program products, i.e., one or more modules ofcomputer program instructions encoded on a computer-readable medium forexecution by, or to control the operation of, data processing apparatus.The computer-readable medium can be a machine-readable storage device, amachine-readable storage substrate, a memory device, a composition ofmatter effecting a machine-readable propagated signal, or a combinationof one or more of them. The term “data processing apparatus” encompassesall apparatus, devices, and machines for processing data, including byway of example a programmable processor, a computer, or multipleprocessors or computers. The apparatus can include, in addition tohardware, code that generates an execution environment for the computerprogram in question, e.g., code that constitutes processor firmware, aprotocol stack, a database management system, an operating system, or acombination of one or more of them. A propagated signal is anartificially generated signal, e.g., a machine-generated electrical,optical, or electromagnetic signal, that is generated to encodeinformation for transmission to suitable receiver apparatus.

A computer program (also known as a program, software, softwareapplication, script, or code) can be written in any form of programminglanguage, including compiled or interpreted languages, and it can bedeployed in any form, including as a stand-alone program or as a module,component, subroutine, or other unit suitable for use in a computingenvironment. A computer program does not necessarily correspond to afile in a file system. A program can be stored in a portion of a filethat holds other programs or data (e.g., one or more scripts stored in amarkup language document), in a single file dedicated to the program inquestion, or in multiple coordinated files (e.g., files that store oneor more modules, sub-programs, or portions of code). A computer programcan be deployed to be executed on one computer or on multiple computersthat are located at one site or distributed across multiple sites andinterconnected by a communication network.

The processes and logic flows described in this specification can beperformed by one or more programmable processors executing one or morecomputer programs to perform functions by operating on input data andgenerating output. The processes and logic flows can also be performedby, and apparatus can also be implemented as, special purpose logiccircuitry, e.g., an FPGA (field programmable gate array) or an ASIC(application-specific integrated circuit).

Processors suitable for the execution of a computer program include, byway of example, both general and special purpose microprocessors, andany one or more processors of any kind of digital computer. Generally, aprocessor will receive instructions and data from a read-only memory ora random access memory or both. The essential elements of a computer area processor for performing instructions and one or more memory devicesfor storing instructions and data. Generally, a computer will alsoinclude, or be operatively coupled to receive data from or transfer datato, or both, one or more mass storage devices for storing data, e.g.,magnetic, magneto-optical disks, or optical disks. However, a computerneed not have such devices. Moreover, a computer can be embedded inanother device, e.g., a mobile telephone, a personal digital assistant(PDA), a mobile audio player, a Global Positioning System (GPS)receiver, to name just a few. Computer-readable media suitable forstoring computer program instructions and data include all forms ofnon-volatile memory, media and memory devices, including by way ofexample semiconductor memory devices, e.g., EPROM, EEPROM, and flashmemory devices; magnetic disks, e.g., internal hard disks or removabledisks; magneto-optical disks; and CD-ROM and DVD-ROM disks. Theprocessor and the memory can be supplemented by, or incorporated in,special purpose logic circuitry.

To provide for interaction with a user, embodiments of the subjectmatter described in this specification can be implemented on a computerhaving a display device, e.g., a CRT (cathode ray tube) or LCD (liquidcrystal display) monitor, for displaying information to the user and akeyboard and a pointing device, e.g., a mouse or a trackball, by whichthe user can provide input to the computer. Other kinds of devices canbe used to provide for interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech, ortactile input.

Embodiments of the subject matter described in this specification can beimplemented in a computing system that includes a back-end component,e.g., as a data server, or that includes a middleware component, e.g.,an application server, or that includes a front-end component, e.g., aclient computer having a graphical user interface or a Web browserthrough which a user can interact with an implementation of the subjectmatter described is this specification, or any combination of one ormore such back-end, middleware, or front-end components. The componentsof the system can be interconnected by any form or medium of digitaldata communication, e.g., a communication network. Examples ofcommunication networks include a local area network (“LAN”) and a widearea network (“WAN”), e.g., the Internet.

The computing system can include clients and servers. A client andserver are generally remote from each other and typically interactthrough a communication network. The relationship of client and serverarises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other.

While this specification contains many specifics, these should not beconstrued as limitations on the scope of the invention or of what may beclaimed, but rather as descriptions of features specific to particularembodiments of the invention. Certain features that are described inthis specification in the context of separate embodiments can also beimplemented in combination in a single embodiment. Conversely, variousfeatures that are described in the context of a single embodiment canalso be implemented in multiple embodiments separately or in anysuitable subcombination. Moreover, although features may be describedabove as acting in certain combinations and even initially claimed assuch, one or more features from a claimed combination can in some casesbe excised from the combination, and the claimed combination may bedirected to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particularorder, this should not be understood as requiring that such operationsbe performed in the particular order shown or in sequential order, orthat all illustrated operations be performed, to achieve desirableresults. In certain circumstances, multitasking and parallel processingmay be advantageous. Moreover, the separation of various systemcomponents in the embodiments described above should not be understoodas requiring such separation in all embodiments, and it should beunderstood that the described program components and systems cangenerally be integrated together in a single software product orpackaged into multiple software products.

Thus, particular embodiments of the invention have been described. Otherembodiments are within the scope of the following claims. For example,the actions recited in the claims can be performed in a different orderand still achieve desirable results.

What is claimed is:
 1. A computer-implemented method comprising:receiving an audio file; analyzing the audio file to identify portionsof the audio file as corresponding to one or more audio types, the oneor more audio types including one or more speech types; generating atime-ordered classification of the identified portions, the time-orderedclassification indicating the one or more audio types and positionwithin the audio file of each respective portion; generating a queueusing the time-ordered classification, the queue including a pluralityof jobs where each job includes one or more identifiers of a respectiveportion of the audio file classified as belonging to the one or morespeech types; distributing the jobs in the queue to a plurality ofprocessors for speech-to-text processing of the corresponding portion ofthe audio file; performing speech-to-text processing on each portion togenerate a corresponding text file; and merging the corresponding textfiles to generate a transcription file, the text files merged in orderbased on the order in which the portions of the audio file occur withinthe audio file.
 2. The method of claim 1, where the distribution of theprocessor job descriptors for the portions of the audio file in thequeue occurs in a first-in-first-out order.
 3. The method of claim 1,where one or more of the jobs in the queue are distributed before thegenerated queue is completed.
 4. The method of claim 3, where mergingthe text files is divided recursively amongst the plurality ofprocessors.
 5. The method of claim 1, where the plurality of processorsare distributed amongst one or more computing devices.
 6. The method ofclaim 1, where the jobs in the queue are distributed to two or moreremote locations, each remote location including a plurality ofprocessors, each remote location connected to one or more other remotelocations.
 7. The method of claim 1, where the one or more audio typesincludes data with spoken language.
 8. The method of claim 1, wheredistributing jobs in the queue to the plurality of processors furthercomprises: partitioning the queue into units based on a specified amountof time; and partitioning the units in substantially equal amounts oftime to each of the plurality of processors.
 9. The method of claim 1,where distributing jobs in the queue for the portions of the audio filein the queue to the plurality of processors further comprises:partitioning the queue into units based on a specified amount of data;and partitioning the units in substantially equal amounts of data toeach of the plurality of processors.
 10. The method of claim 1, where aclient device identifies and classifies the portions of the audio fileinto the one or more audio types within the audio file.
 11. The methodof claim 1, further comprising: partitioning the received audio file foridentification and classification using a plurality of classifiers. 12.The method of claim 11, where the plurality of classifiers comprise oneor more of the following classifiers: dialogue, applause, music,silence, and ambient noise.
 13. The method of claim 1, furthercomprising: storing the transcription file.
 14. The method of claim 1,where generating the queue includes generating portion descriptors foreach portion.
 15. The method of claim 14, where the portion descriptorscomprise metadata associated with amount of time and one or moreclassifiers associated with the portions of the audio file.
 16. Themethod of claim 1, where the time-ordered classification includesdetermining a time interval for each portion of the identified portions.17. A computer program product, encoded on a computer-readable medium,operable to cause data processing apparatus to perform operationscomprising: receiving an audio file; analyzing the audio file toidentify portions of the audio file as corresponding to one or moreaudio types, the one or more audio types including one or more speechtypes; generating a time-ordered classification of the identifiedportions, the time-ordered classification indicating the one or moreaudio types and position within the audio file of each respectiveportion; generating a queue using the time-ordered classification, thequeue including a plurality of jobs where each job includes one or moreidentifiers of a respective portion of the audio file classified asbelonging to the one or more speech types; distributing the jobs in thequeue to a plurality of processors for speech-to-text processing of thecorresponding portion of the audio file; performing speech-to-textprocessing on each portion to generate a corresponding text file; andmerging the corresponding text files to generate a transcription file,the text files merged in order based on the order in which the portionsof the audio file occur within the audio file.
 18. The computer programproduct of claim 17, where the distribution of the processor jobdescriptors for the portions of the audio file in the queue occurs in afirst-in-first-out order.
 19. The computer program product of claim 17,where one or more of the jobs in the queue are distributed before thegenerated queue is completed.
 20. The computer program product of claim19, where merging the text files is divided recursively amongst theplurality of processors.
 21. The computer program product of claim 17,where the plurality of processors are distributed amongst one or morecomputing devices.
 22. The computer program product of claim 17, wherethe jobs in the queue are distributed to two or more remote locations,each remote location including a plurality of processors, each remotelocation connected to one or more other remote locations.
 23. Thecomputer program product of claim 17, where the one or more audio typesincludes data with spoken language.
 24. The computer program product ofclaim 17, where distributing jobs in the queue to the plurality ofprocessors further comprises: partitioning the queue into units based ona specified amount of time; and partitioning the units in substantiallyequal amounts of time to each of the plurality of processors.
 25. Thecomputer program product of claim 17, where distributing jobs in thequeue for the portions of the audio file in the queue to the pluralityof processors further comprises: partitioning the queue into units basedon a specified amount of data; and partitioning the units insubstantially equal amounts of data to each of the plurality ofprocessors.
 26. The computer program product of claim 17, where a clientdevice identifies and classifies the portions of the audio file into theone or more audio types within the audio file.
 27. The computer programproduct of claim 17, further operable to perform operations comprising:partitioning the received audio file for identification andclassification using a plurality of classifiers.
 28. The computerprogram product of claim 27, where the plurality of classifiers compriseone or more of the following classifiers: dialogue, applause, music,silence, and ambient noise.
 29. The computer program product of claim17, further operable to perform operations comprising: storing thetranscription file.
 30. The computer program product of claim 17, wheregenerating the queue includes generating portion descriptors for eachportion.
 31. The computer program product of claim 30, where the portiondescriptors comprise metadata associated with amount of time and one ormore classifiers associated with the portions of the audio file.
 32. Thecomputer program product of claim 17, where the time-orderedclassification includes determining a time interval for each portion ofthe identified portions.
 33. A system comprising: a processor and amemory operable to perform operations including: analyzing the audiofile to identify portions of the audio file as corresponding to one ormore audio types, the one or more audio types including one or morespeech types; generating a time-ordered classification of the identifiedportions, the time-ordered classification indicating the one or moreaudio types and position within the audio file of each respectiveportion; generating a queue using the time-ordered classification, thequeue including a plurality of jobs where each job includes one or moreidentifiers of a respective portion of the audio file classified asbelonging to the one or more speech types; distributing the jobs in thequeue to a plurality of processors for speech-to-text processing of thecorresponding portion of the audio file; performing speech-to-textprocessing on each portion to generate a corresponding text file; andmerging the corresponding text files to generate a transcription file,the text files merged in order based on the order in which the portionsof the audio file occur within the audio file.
 34. The system of claim33, where the distribution of the processor job descriptors for theportions of the audio file in the queue occurs in a first-in-first-outorder.
 35. The system of claim 33, where one or more of the jobs in thequeue are distributed before the generated queue is completed.
 36. Thesystem of claim 35, where merging the text files is divided recursivelyamongst the plurality of processors.
 37. The system of claim 33, wherethe plurality of processors are distributed amongst one or morecomputing devices.
 38. The system of claim 33, where the jobs in thequeue are distributed to two or more remote locations, each remotelocation including a plurality of processors, each remote locationconnected to one or more other remote locations.
 39. The system of claim33, where the one or more audio types includes data with spokenlanguage.
 40. The system of claim 33, where distributing jobs in thequeue to the plurality of processors further comprises: partitioning thequeue into units based on a specified amount of time; and partitioningthe units in substantially equal amounts of time to each of theplurality of processors.
 41. The system of claim 33, where distributingjobs in the queue for the portions of the audio file in the queue to theplurality of processors further comprises: partitioning the queue intounits based on a specified amount of data; and partitioning the units insubstantially equal amounts of data to each of the plurality ofprocessors.
 42. The system of claim 33, where a client device identifiesand classifies the portions of the audio file into the one or more audiotypes within the audio file.
 43. The system of claim 33, furtheroperable to perform operations comprising: partitioning the receivedaudio file for identification and classification using a plurality ofclassifiers.
 44. The system of claim 43, where the plurality ofclassifiers comprise one or more of the following classifiers: dialogue,applause, music, silence, and ambient noise.
 45. The system of claim 33,further operable to perform operations comprising: storing thetranscription file.
 46. The system of claim 33, where generating thequeue includes generating portion descriptors for each portion.
 47. Thesystem of claim 46, where the portion descriptors comprise metadataassociated with amount of time and one or more classifiers associatedwith the portions of the audio file.
 48. The system of claim 33, wherethe time-ordered classification includes determining a time interval foreach portion of the identified portions.
 49. A method performed by acomputer programmed to provide speech-to-text processing, the methodcomprising: receiving an audio file; analyzing the audio file toidentify portions of the audio file as corresponding to one or moreaudio types, the one or more audio types including one or more speechtypes; generating a time-ordered classification of the identifiedportions, the time-ordered classification indicating the one or moreaudio types and position within the audio file of each respectiveportion; generating a queue using the time-ordered classification, thequeue including a plurality of jobs where each job includes one or moreidentifiers of a respective portion of the audio file classified asbelonging to the one or more speech types; distributing the jobs in thequeue to a plurality of processors for speech-to-text processing of thecorresponding portion of the audio file; performing speech-to-textprocessing on each portion to generate a corresponding text file; andmerging the corresponding text files to generate a transcription file,the text files merged in order based on the order in which the portionsof the audio file occur within the audio file.